Subword Retrieval on Biomedical Documents
نویسندگان
چکیده
Document retrieval in languages with a rich and complex morphology – particularly in terms of derivation and (single-word) composition – suffers from serious performance degradation with the stemming-only query-term-totext-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, prefixes, suffixes), and subwords constitute the basic unit for indexing and retrieval. We evaluate our approach on a biomedical document collection.
منابع مشابه
Multilayer subword units for open-vocabulary spoken document retrieval
This paper describes the application of subword units in an effort of improving open-vocabulary spoken document retrieval performance in the case of highly corrupted recognition output. This paper presents the developed open-vocabulary spoken document retrieval system including the newly proposed subphonetic segment unit and combining multilayer subword units. Our experiments on Japanese spoken...
متن کاملSubword Clusters as Light-Weight Interlingua for Multilingual Document Retrieval
We introduce a light-weight interlingua for a crosslanguage document retrieval system in the medical domain. It is composed of equivalence classes of semantically primitive, language-specific subwords which are clustered by interlingual and intralingual synonymy. Each subword cluster represents a basic conceptual entity of the language-independent interlingua. Documents, as well as queries, are...
متن کاملInformation fusion for spoken document retrieval
In this paper we investigate the fusion of different information sources with the goal of improving performance on spoken document retrieval (SDR) tasks. In particular, we explore the use of multiple transcriptions from different automatic speech recognizers, the combination of different types of subword unit indexing terms, and the combination of word and subword-based units. To perform retrie...
متن کاملMulti-scale-audio indexing for translingual spoken document retrieval
MEI (Mandarin-English Information) is an English-Chinese crosslingual spoken document retrieval (CL-SDR) system developed during the Johns Hopkins University Summer Workshop 2000. We integrate speech recognition, machine translation, and information retrieval technologies to perform CL-SDR. MEI advocates a multi-scale paradigm, where both Chinese words and subwords (characters and syllables) ar...
متن کاملSubword-Based Text Retrieval
Document retrieval in languages with a rich and complex morphology – particularly in terms of derivation and (single-word) composition – suffers from serious performance degradation with the stemming-only query-term-totext-word matching paradigm. We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords (such as stems, prefixes, suffixes...
متن کامل